Note: This is a completely revised version of the article that was originally published in ACM Crossroads, Volume 13, Issue 4. Revisions were needed because of major changes to the Natural Language Toolkit project. The code in this notebook conforms to the latest version of NLTK (v3.0.1 as of March 2015) and works only with Python 3. Although the code is always tested, it is possible that a bug or two may have been introduced in the code during the course of this revision. If you find any, please report them to the author. If you are still using version 0.7 of the toolkit for some reason, please refer to http://www.acm.org/crossroads/xrds13-4/natural_language.html.
The intent of this article is to introduce the readers to the area of Natu- ral Language Processing, commonly referred to as NLP. However, rather than just describing the salient concepts of NLP, this article uses the Python programming language to illustrate them as well. For readers unfamiliar with Python, the article provides a number of references to learn how to program in Python.
The term Natural Language Processing encompasses a broad set of techniques for automated generation, manipulation and analysis of natural or human languages. Although most NLP techniques inherit largely from Linguis- tics and Artificial Intelligence, they are also influenced by relatively newer areas such as Machine Learning, Computational Statistics and Cognitive Science. Before we see some examples of NLP techniques, it will be useful to introduce some very basic terminology. Please note that as a side effect of keeping things simple, these definitions may not stand up to strict linguistic scrutiny.
The Python programming language is a dynamically-typed, object-oriented interpreted language. Although, its primary strength lies in the ease with which it allows a programmer to rapidly prototype a project, its powerful and mature set of standard libraries make it a great fit for large-scale production-level software engineering projects as well. Python has a very shallow learning curve and an excellent online learning resource [11].
Although Python already has most of the functionality needed to perform simple NLP tasks, it’s still not powerful enough for most standard NLP tasks. This is where the Natural Language Toolkit (NLTK) comes in [12]. NLTK is a collection of modules and corpora, released under an open-source license, that allows students to learn and conduct research in NLP. The most important advantage of using NLTK is that it is entirely self-contained. Not only does it provide convenient functions and wrappers that can be used as building blocks for common NLP tasks, it also provides raw and pre-processed versions of standard corpora used in NLP literature and courses.
The NLTK website contains excellent documentation and tutorials for learn- ing to use the toolkit [13]. It would be unfair to the authors, as well as to this publication, to just reproduce their words for the sake of this article. In- stead, I will introduce NLTK by showing how to perform four NLP tasks, in increasing order of difficulty. Each task is either an unsolved exercise from the NLTK tutorial or a variant thereof. Therefore, the solution and analysis of each task represents original content written solely for this article.
As mentioned earlier, NLTK ships with several useful text corpora that are used widely in the NLP research community. In this section, we look at three of these corpora that we will be using in our tasks below:
Before, we begin using NLTK for our tasks, it is important to familiarize ourselves with the naming conventions used in the toolkit. The top-level package is called nltk
and we can refer to the included modules by using their fully qualified dotted names, e.g. nltk.corpus
and nltk.utilities
. The contents of any such module can then be imported into the top-level namespace by using the standard "from ... import ...
" construct in Python.
NLTK is distributed with several NLP corpora, as mentioned before. We define this task in terms of exploring one of such corpora.
Task: Use the NLTK corpus module to read the corpus austen-persuasion.txt, included in the Gutenberg corpus collection, and answer the following questions:
Besides the corpus module that allows us to access and explore the bundled corpora with ease, NLTK also provides the probability module that contains several useful classes and functions for the task of computing probability distributions. One such class is called FreqDist
and it keeps track of the sample frequencies in a distribution.
The next set of cells show how to use these two modules to perform the first task.
In [23]:
# first, import the gutenberg collection
from nltk.corpus import gutenberg
# what corpora are in the collection ?
print(gutenberg.fileids())
In [24]:
# import FreqDist class
from nltk import FreqDist
# create frequency distribution object
fd = FreqDist()
# for each token in the relevant text, increment its counter
for word in gutenberg.words('austen-persuasion.txt'):
fd[word] += 1
print(fd.N()) # total number of samples
In [25]:
print(fd.B())# number of bins or unique samples
In [26]:
# Get a list of the top 10 words sorted by frequency
for word, count in fd.most_common(10):
print(word, count)
Solution: Jane Austen’s book Persuasion contains 98171 total tokens and 6132 unique tokens. Out of these, the most common token is a comma, followed by the word the. In fact, the last part of this task is the perfect segue for one of the most interesting empirical observations about word occurrences. If we were to take a large corpus, count up the number of times each word occurs in that corpus and then list the words according to the number of occurrences (starting with the most frequent), we would be able to observe a direct relationship between the frequency of a word and its position in the list. In fact, Zipf claimed this relationship could be expressed mathematically, i.e. for any given word, fr = k
, where f
is the frequency of that word, r
is the rank, or the position of the word in the sorted list, and k
is a constant. So, for example, the 5th most frequent word should occur exactly two times as frequently as the 10th most frequent word. In NLP literature, the above relationship is usually referred to as Zipf’s Law.
Even though the mathematical relationship prescribed by Zipf’s Law may not hold exactly, it is useful to describe how words are distributed in human languages - there are a few words that are very common, a few that occur with medium frequency and a very large number of words that occur very rarely. It’s simple to extend the last part of Task 1 and graphically visualize this relationship using NLTK.
In [27]:
%matplotlib inline
In [28]:
import matplotlib.pyplot as plt
# Count each token in each text of the Gutenberg collection
fd = FreqDist()
for text in gutenberg.fileids():
for word in gutenberg.words(text):
fd[word] += 1
# Initialize two empty lists which will hold our ranks and frequencies
ranks = []
freqs = []
# Generate a (rank, frequency) point for each counted token and # and append to the respective lists, Note that the iteration
# over fd is automatically sorted.
for rank, (word, _) in enumerate(fd.most_common()):
ranks.append(rank + 1)
freqs.append(fd[word])
# Plot rank vs frequency on a log-log plot and show the plot
plt.loglog(ranks, freqs)
plt.xlabel('frequency(f)', fontsize=14, fontweight='bold')
plt.ylabel('rank(r)', fontsize=14, fontweight='bold')
plt.grid(True)
Now that we have learnt how to explore a corpus, let’s define a task that can put such explorations to use.
Task: Train and build a word predictor, i.e., given a training corpus, write a program that can predict the word that follows a given word. Use this predictor to generate a random sentence of 20 words.
To build a word predictor, we first need to compute a distribution of two-word sequences over a training corpus, i.e., we need to keep count the occurrences of a word given the previous word as a context for that word. Once we have computed such a distribution, we can use the input word to find a list of all possible words that followed it in the training corpus and then output a word at random from this list. To generate a random sentence of 20 words, all we have to do is to start at the given word, predict the next word using this predictor, then the next and so on until we get a total of 20 words. The next set of cells illustrate how to accomplish this easily using the modules provided by NLTK. We use Jane Austen’s Persuasion as the training corpus.
In [29]:
from nltk import ConditionalFreqDist
from random import choice
# Create conditional distribution object
cfd = ConditionalFreqDist()
# For each token, count current word given previous word
prev_word = None
for word in gutenberg.words('austen-persuasion.txt'):
cfd[prev_word][word] += 1
prev_word = word
# Start predicting at the given word, say ’therefore’
word = 'therefore'
i = 1
# Find all words that can possibly follow the current word and choose one at random
while i < 20:
print(word, end=" ")
lwords = list(cfd[word].keys())
follower = choice(lwords)
word = follower
i += 1
Solution: The 20 word output sentence is, of course, not grammatical but every two word sequence will be because the training corpus that we used for estimating our conditional frequency distribution is grammatical and because of the way that we estimated the conditional frequency distribu- tion. Note that for our task we used only the previous word as the context for our predictions. It is certainly possible to use the previous two or, even, three words as the prediction context.
NLTK comes with an excellent set of modules to allow us to train and build relatively sophisticated POS taggers. However, for this task, we will restrict ourselves to a simple analysis on an already tagged corpus included with NLTK.
Task: Tokenize the included Brown Corpus and build one or more suitable data structures so that you can answer the following questions:
For this task, it is important to note that there is are two versions of the Brown corpus that comes bundled with NLTK: the first is the raw corpus that we used in the last two tasks, and the second is a tagged version wherein each token of each sentence of the corpus has been annotated with the correct POS tags. Each sentence in this version of a corpus is represented as a list of 2-tuples, each of the form (token, tag). For example, a sentence like “the ball is green
”, from a tagged corpus, will be represented inside NLTK as the list [('the','at'), ('ball','nn'), ('is',’vbz'), ('green','jj')]
.
As explained before, the Brown corpus comprises of 15 different sections, represented by the letters 'a' through 'r'. Each of the sections represents a different genre of text and for certain NLP tasks not discussed in this article, this division proves very useful. Given this information, all we should have to do is build the data structures to analyze this tagged corpus. Looking at the kinds of questions that we need to answer, it will be sufficient to build a frequency distribution over the POS tags and a conditional frequency distribution over the tags using the tokens as the context. The next set of cells illustrate the solution for the task.
In [30]:
from nltk.corpus import brown
from nltk import FreqDist, ConditionalFreqDist
fd = FreqDist()
cfd = ConditionalFreqDist()
# for each tagged sentence in the corpus, get the (token, tag) pair and update
# both count(tag) and count(tag given token)
for sentence in brown.tagged_sents():
for (token, tag) in sentence:
fd[tag] += 1
cfd[token][tag] += 1
# The most frequent tag is ...
fd.max()
Out[30]:
In [31]:
# Initialize a list to hold (numtags,word) tuple
wordbins = []
# append each (n(unique tags for token),token) tuple to list
for token in cfd.conditions():
wordbins.append((cfd[token].B(), token))
# sort tuples by number of unique tags (highest first)
wordbins.sort(reverse=True)
# the token with the maximum number of possible part-of-speech tags is ...
print(wordbins[0])
In [32]:
# masculine pronouns
male = ['he', 'his', 'him', 'himself']
# feminine pronouns
female = ['she', 'hers', 'her', 'herself']
# initialize counters
n_male, n_female = 0, 0
# total number of masculine samples
for m in male:
n_male += cfd[m].N()
# total number of feminine samples
for f in female:
n_female += cfd[f].N()
# calculate required ratio
print(float(n_male)/n_female)
In [33]:
n_ambiguous = 0
for (ntags, token) in wordbins:
if ntags > 1:
n_ambiguous += 1
# number of tokens with more than a single POS tag
print(n_ambiguous)
Solution: The most frequent POS tag in the Brown corpus is, unsurprisingly, the noun (NN). The word that has the most number of unique tags is, in fact, the word that
. There are almost 3 times as many masculine pronouns in the corpus as feminine pronouns and, finally, there are as many as 8700 words in the corpus that can be deemed ambiguous - a number that should indicate the difficulty of the POS-tagging task.
The task of free word association is a very common one when it comes to psycholinguistics, especially in the context of lexical retrieval -- human subjects respond more readily to a word if it follows another highly associated word as opposed to a completely unrelated word. The instructions for performing the association are fairly straightforward -- the subject is asked for the word that immediately comes to mind upon hearing a particular word.
Task: Use a large POS-tagged text corpus to perform free word association. You may ignore function words and assume that the words to be associated are always nouns.
For this task, we will use the concept of word co-occurrences, i.e., counting the number of times words occur in close proximity with each other and then using these counts to estimate the degree of association. For each token in each sentence, we will look at all following tokens that lie within a fixed window and count their occurrences in this context using a con- ditional frequency distribution. The next set of cells show how we accomplish this using Python and NLTK with a window size of 5 and the POS-tagged version of the Brown corpus.
In [34]:
from nltk.corpus import brown, stopwords
# initialize a new conditional distribution
cfd = ConditionalFreqDist()
# get a list of English stopwords
stopwords_list = stopwords.words('english')
def is_noun(tag):
return tag.lower() in ['nn','nns','nn$','nn-tl','nn+bez', 'nn+hvz',
'nns$','np','np$','np+bez','nps', 'nps$','nr',
'np-tl','nrs','nr$']
for sentence in brown.tagged_sents():
for (index, tagtuple) in enumerate(sentence):
(token, tag) = tagtuple
token = token.lower()
if token not in stopwords_list and is_noun(tag):
window = sentence[index+1:index+5]
for (window_token, window_tag) in window:
window_token = window_token.lower()
if window_token not in stopwords_list and is_noun(window_tag):
cfd[token][window_token] += 1
In [35]:
# OK. We are done ! Let's start associating !
print(cfd['left'].max())
In [36]:
print(cfd['life'].max())
In [37]:
print(cfd['man'].max())
In [38]:
print(cfd['woman'].max())
In [39]:
print(cfd['boy'].max())
In [40]:
print(cfd['girl'].max())
In [41]:
print(cfd['male'].max())
In [42]:
print(cfd['ball'].max())
In [43]:
print(cfd['doctor'].max())
In [44]:
print(cfd['road'].max())
The “word associator” that we have built seems to work surprisingly well, especially when compared to the minimal amount of effort that was required. (In fact, in the context of folk psychology, our associator would almost seem to have a personality, albeit a pessimistic and misogynistic one). The results of this task should be a clear indication of the usefulness of corpus linguistics in general. As a further exercise, the association task can be easily extended in sophistication by utilizing parsed corpora and using information-theoretic measures of association [3].
Although this article used Python and NLTK to provide an introduction to basic natural language processing, it is important to note that there are other NLP frameworks, besides NLTK, that are used by the NLP academic and industrial community. A popular example is GATE (General Architecture for Text Engineering), developed by the NLP research group at the University of Sheffield [4]. GATE is built on the Java and provides, besides the framework, a general architecture which describes how language pro- cessing components connect to each other and a graphical environment. GATE is freely available and is primarily used for text mining and information extraction.
Every programming language and framework has its own strengths and weaknesses. For this article, we chose to use Python because it possesses a number of advantages over the other programming languages such as: (a) readability (b) easy to use object-oriented paradigm (c) easily extensible (d) strong unicode support and, (e) a powerful standard library. It is also extremely robust and efficient and has been used in com- plex and large-scale NLP projects such as a state-of-the-art machine translation decoder [2].
Natural Language Processing is a very active field of research and attracts many graduate students every year. It allows a coherent study of the hu- man language from the vantage points of several disciplines -- Linguistics, Psychology, Computer Science and Mathematics. Another, perhaps more important, reason for choosing NLP as an area of graduate study is the sheer number of very interesting problems with well-established constraints but no general solutions. For example, the original problem of machine translation, which spurred the growth of the field, remains, even after two decades of intriguing and active research, one of the hardest problems to solve. There are several other cutting-edge areas in NLP that currently draw a large amount of research activity. It would be informative to discuss a few of them here:
Automatic Multi-document Text Summarization: There are a large number of efforts underway to use computers to automatically generate coherent and informative summaries for a cluster of related documents [8]. This task is considerably more difficult compared to generating a summary for a single document because there may be redun- dant information present across multiple documents.
Computational Parsing: Although the problem of using probabilistic models to automatically generating syntactic structures for a given input text has been around for a long time, there are still significant improvements to be made. The most challenging task is to be able to parse, with reasonable accuracy, languages that exhibit very different linguistic properties when compared to English, such as Chinese [7] and Arabic.
Python and the Natural Language Toolkit (NLTK) allow any programmer to get acquainted with NLP tasks easily without having to spend too much time on gathering resources. This article is intended to make this task even easier by providing working examples and references for anyone interested in learning about NLP.
Nitin Madnani is a research scientist at Educational Testing Service. He was previously a Ph.D. student in the Department of Computer Science at University of Maryland, College Park and a graduate research assistant with the Institute for Advanced Computer Studies. He works in the area of statistical natural language processing, specifically paraphrasing, machine translation and text summarization. His language of choice for all tasks, big or small, is Python.
Dan Bikel. 2004. On the Parameter Space of Generative Lexicalized Statistical Parsing Models. Ph.D. Thesis. http://www.cis.upenn.edu/~dbikel/papers/thesis.pdf
David Chiang. 2005. A hierarchical phrase-based model for statistical machine translation. Proceedings of ACL.
Kenneth W. Church and Patrick Hanks. 1990. Word association norms, mutual information, and lexicography. Computational Linguistics. 16(1).
H. Cunningham, D. Maynard, K. Bontcheva. and V. Tablan. 2002. GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. Proceedings of the 40th Anniversary Meet- ing of the Association for Computational Linguistics.
Michael Hart and Gregory Newby. Project Gutenberg. Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics. http://www.gutenberg.org/wiki/Main_Page
H. Kucera and W. N. Francis. 1967. Computational Analysis of Present-Day American English. Brown University Press, Providence, RI.
Roger Levy and Christoper D. Manning. 2003. Is it harder to parse Chinese, or the Chinese Treebank? Proceedings of ACL.
Dragomir R. Radev and Kathy McKeown. 1999. Generating natural language summaries from multiple on-line sources. Computational Lin- guistics. 24:469-500.
Adwait Ratnaparkhi 1996. A Maximum Entropy Part-Of-Speech Tagger. Proceedings of Empirical Methods on Natural Language Processing.
Dekai Wu and David Chiang. 2007. Syntax and Structure in Statistical Translation. Workshop at HLT-NAACL.
The Official Python Tutorial. https://docs.python.org/3.4/tutorial/
Natural Language Toolkit. http://nltk.org
NLTK Book/Tutorial. http://www.nltk.org/book/